[pve-devel] corosync problems

Discussion:

[pve-devel] corosync problems - need help

Alexandre DERUMIER

2014-09-14 06:18:09 UTC

Hi,

I have a corosync problem on my production cluster,
and I don't known how to debug.

Cluster is a 12 nodes cluster,
multicast is working fine

on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"

All nodes are show
------------------
# cman_tool nodes
Node Sts Inc Joined Name
1 M 76636 2014-09-08 12:23:04 kvm6
2 M 76636 2014-09-08 12:23:04 kvm4
3 M 76636 2014-09-08 12:23:04 kvm3
4 M 76636 2014-09-08 12:23:04 kvm2
5 M 76636 2014-09-08 12:23:04 kvm5
6 M 76672 2014-09-12 16:52:08 kvm1
7 M 76636 2014-09-08 12:23:04 kvm8
8 M 76636 2014-09-08 12:23:04 kvm7
8 M 76636 2014-09-08 12:23:04 kvm9
10 M 76636 2014-09-08 12:23:04 kvm10
11 M 76944 2014-09-14 08:08:18 kvm11
12 M 4 2014-09-03 06:57:27 kvm12

I have quorum
--------------
#cman_tool status
Version: 6.2.0
Config Version: 12
Cluster Name: odiso
Cluster Id: 3337
Cluster Member: Yes
Cluster Generation: 76944
Membership state: Cluster-Member
Nodes: 12
Expected votes: 12
Total votes: 12
Node votes: 1
Quorum: 7
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: kvm12
Node ID: 12
Multicast addresses: 239.192.13.22
Node addresses: 10.3.94.59

But I can't write anything in pmxcfs on any node (read is ok)

with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330

Any idea ?

I would like to try to change corosync window_size, but how can I do it online ?

(and /etc/init.d/cman stop is hanging)

Dietmar Maurer

2014-09-14 06:41:09 UTC

Permalink

Post by Alexandre DERUMIER
on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"

What kernel do you run? 2.6.32 or 3.10.0?
What is different on those nodes? kernel, network cards?

Post by Alexandre DERUMIER
But I can't write anything in pmxcfs on any node (read is ok)
with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330
Any idea ?

Does it help if you restart the cluster file system:

# service pve-cluster restart

Note: You also need to restart depending services afterwards:

# service pvedaemon restart
# service pveproxy restart
# service pvestatd restart

Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online ?

On all nodes?

Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)

I guess you already tried to reboot that node?

Alexandre DERUMIER

2014-09-14 07:05:45 UTC

Permalink

Post by Dietmar Maurer

Post by Dietmar Maurer
What kernel do you run? 2.6.32 or 3.10.0?

1 node 2.6.32 , 1 node 3.10

What is different on those nodes? kernel, network cards?

All nodes are same model, but I have 3 nodes with kernel 3.10 and 8 nodes with 2.6.32 kernel.
(I'm currently migrate all nodes to 3.10)

I have added 2 nodes (kvm11,kvm12) with 3.10 kernel 1 week ago (without any multicast problem)

Post by Dietmar Maurer

Post by Dietmar Maurer
I would like to try to change corosync window_size, but how can I do it online ?

On all nodes?

Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject it online.

Post by Dietmar Maurer

Post by Dietmar Maurer
(and /etc/init.d/cman stop is hanging)

I guess you already tried to reboot that node?

I can't reboot for now, it's a production node, and I can't live migrate as pmxcfs is read only.

I'll try to restart all services on all nodes to see if It's help

----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>, pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 08:41:09
Objet: RE: [pve-devel] corosync problems - need help

Post by Dietmar Maurer
on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"

What kernel do you run? 2.6.32 or 3.10.0?
What is different on those nodes? kernel, network cards?

Post by Dietmar Maurer
But I can't write anything in pmxcfs on any node (read is ok)
with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330
Any idea ?

Post by Dietmar Maurer
I would like to try to change corosync window_size, but how can I do it online ?

On all nodes?

Post by Dietmar Maurer
(and /etc/init.d/cman stop is hanging)

I guess you already tried to reboot that node?

Dietmar Maurer

2014-09-14 07:27:22 UTC